NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

"My Very Subjective Human Interpretation": Domain Expert Perspectives on Navigating the Text Analysis Loop for Topic Models

https://doi.org/10.1145/3701201

Schofield, Alexandra; Wu, Siqi; Bayard_de_Volo, Theo; Kuze, Tatsuki; Gomez, Alfredo; Sultana, Sharifa (January 2025, Proceedings of the ACM on Human-Computer Interaction - GROUP)

Practitioners dealing with large text collections frequently use topic models such as Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) in their projects to explore trends. Despite twenty years of accrued advancement in natural language processing tools, these models are found to be slow and challenging to apply to text exploration projects. In our work, we engaged with practitioners (n=15) who use topic modeling to explore trends in large text collections to understand their project workflows and investigate which factors often slow down the processes and how they deal with such errors and interruptions in automated topic modeling. Our findings show that practitioners are required to diagnose and resolve context-specific problems with preparing data and models and need control for these steps, especially for data cleaning and parameter selection. Our major findings resonate with existing work across CSCW, computational social science, machine learning, data science, and digital humanities. They also leave us questioning whether automation is actually a useful goal for tools designed for topic models and text exploration.
more » « less
Full Text Available
Introducing tSLDA: A Workflow-Oriented Topic Modeling Tool.

Babb, Simon; Celeste, Mia; Harris, Dana; Wu, Ingrid; Bayard de Volo, Theo; Gomez, Alfredo; Kuze, Tatsuki; Lee, Taeyun; Mimno, David; Schofield, Alexandra (October 2021, WeCNLP (West Coast NLP) Summit)

Full Text Available
Combatting The Challenges of Local Privacy for Distributional Semantics with Compression

Schofield, Alexandra; Yauney, Gregory; Mimno, David (December 2019, PriML workshop at NeurIPS)

Traditional methods for adding locally private noise to bag-of-words features overwhelm the true signal in the text data, removing the properties of sparsity and non-negativity often relied upon by distributional semantic models. We argue the formulation of limited-precision local privacy, which guarantees privacy between documents of less than a user-specified maximum distance, is a more appropriate framework for bag-of-words features. To reduce the number of features to which we must add random noise, we also compress word features before adding noise, then decompress those features before model inference. We test randomized methods of aggregation as well as methods informed by distributional properties of words. Applying LDA and LSA to synthetic and real data, we show that these approaches produce distributional models closer to those in the original data.
more » « less
Full Text Available
Quantifying the Effects of Text Duplication on Semantic Models

Schofield, Alexandra; Magnusson, Mans; Mimno, David (January 2018, Empirical Methods in Natural Language Processing)

Duplicate documents are a pervasive problem in text datasets and can have a strong effect on unsupervised models. Methods to remove duplicate texts are typically heuristic or very expensive, so it is vital to know when and why they are needed. We measure the sensitivity of two latent semantic methods to the presence of different levels of document repetition. By artificially creating different forms of duplicate text we confirm several hypotheses about how repeated text impacts models. While a small amount of duplication is tolerable, substantial over-representation of subsets of the text may overwhelm meaningful topical patterns.
more » « less
Full Text Available

Search for: All records